Today we’ll be working with the diamonds dataset from the ggplot2 package. We want to understand how various features of the diamond influence its price. We will be looking specifically at carat and color.
Let’s load the ggplot2
package and the diamonds dataset. (Install the package with install.packages("ggplot2")
if you have not done so yet.) Look at the documentation to understand what the dataset is about.
library(ggplot2)
data(diamonds)
?diamonds
As usual, we can use str()
, head()
or View()
to see the dataset:
str(diamonds)
## Classes 'tbl_df', 'tbl' and 'data.frame': 53940 obs. of 10 variables:
## $ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
head(diamonds)
## # A tibble: 6 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
Histograms are a good way of understanding the distribution of a single variable. In this dataset, the variable that is probably of greatest interest is price. Let’s plot a histogram of price to understand its distribution:
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = price))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
R defaults to 30 bins for the histogram. We can change this by adding a bins
argument to geom_histogram()
. As you can see from the histograms below, different bin widths can give very different interpretations of the data!
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = price), bins = 2)
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = price), bins = 100)
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = price), bins = 1000)
Since price is an important variable, we want to understand which characteristics of a diamond affect it and how.
A first guess would be that the weight of a diamond, indicated by carat, would heavily influence price. Let’s make a scatterplot of price vs. carat:
ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price))
Wow, what a mess! That’s because we have so many data points being plotted over each other (this is called overplotting). Are there more diamonds in the 0-1 carat range or the 2-3 carat range? It’s hard to tell. One way to address this is to modify the transparency of each point by adjusting “alpha”. By default, alpha = 1
, which represents being fully opaque. We can reduce alpha (alpha = 0.05
means that 20 points are needed to get full opacity):
ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price), alpha = 0.05)
There’s still a fair bit of overplotting going on, but some characteristics of the data become more obvious. For example, the carat size of diamonds seem to bunch up around certain values (e.g. just above 1, 1.5, 2). This may be worth investigating.
Instead of filled circles, we could change the shape of the points manually through the shape
argument (see this reference for which symbols correspond to each shape
value):
ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price), alpha = 0.05, shape = 4)
It’s debatable that changing the shape helped make the plot clearer.
Is there a relationship between price and carat? It does seem so. We can add a geom_smooth()
layer that tries to determine the relationship between the two:
ggplot(data = diamonds, mapping = aes(x = carat, y = price)) +
geom_point(alpha = 0.05) +
geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
The heavier the diamond, the more expensive it is. At the same time, we see quite a wide spread of prices for diamonds of the same weight, indicating that there are probably other factors at play.
Looking at the dataset, we might guess that cut might be an important factor determining the price of a diamond as well. Let’s try a scatterplot:
ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
geom_point()
That’s not informative at all! We see a lot of overplotting going on. Let’s use the alpha trick that we used previously:
ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
geom_point(alpha = 0.05)
In this case, changing alpha on its own is not going to help much, since all the points are still going to lie on one vertical line. Let’s add jitter (i.e. move the points by a small random amount) to get a better view.
ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
geom_point(alpha = 0.05, position = "jitter")
Because jittering is such a common operation, instead of adding position = "jitter"
as an argument to geom_point()
, we can use geom_jitter()
directly to get the same plot:
ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
geom_jitter(alpha = 0.05)
This is slightly better in that we start to see some trends, but there is still a lot of overplotting: look at the concentration of black dots near the bottom of the plot.
Instead of plotting all the data points, we can use boxplots or violin plots to look at summary statistics instead:
ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
geom_boxplot()
ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
geom_violin()
Interesting! The bulk of the distribution of prices is roughly the same, no matter what the cut is. In fact, from the violin plot, there seem to be a lot of diamonds of ideal cut which have very low prices!
It seems unintuitive that the cut of a diamond does not affect its price, and that diamonds of ideal cut have lower prices. Could there be other factors at work? One possibility is that there just aren’t many large diamonds of ideal cut: thus, a diamond of ideal cut tends to weigh less (smaller in carat size), and hence fetches a lower price.
We can explore this theory by modifying other aesthetics of our original scatterplot. For example, we can let the color of each dot signify its cut:
ggplot(data = diamonds, mapping = aes(x = carat, y = price, col = cut)) +
geom_point(alpha = 0.2)
There seem to be more yellow dots on top and more purple dots below, lending credence to the intuitive assumption that better cut results in better quality.
In this case, changing the color of the dots helped us to understand the data better. Choosing which aesthetics to modify is an important skill to learn. For example, it would have been a bad idea to relate the cut of a diamond to its size or shape:
ggplot(data = diamonds, mapping = aes(x = carat, y = price, size = cut)) +
geom_point(alpha = 0.2)
ggplot(data = diamonds, mapping = aes(x = carat, y = price, shape = cut)) +
geom_point(alpha = 0.2)
## Warning: Using shapes for an ordinal variable is not advised
Let’s go back to our colored plot. The colors here are the R defaults. We can introduce our own color scale with scale_color_brewer()
to make the plot more informative (the full list of color palettes can be found through google image search):
ggplot(data = diamonds, mapping = aes(x = carat, y = price, col = cut)) +
geom_point(alpha = 0.2) +
scale_color_brewer(palette = "YlOrRd")
There’s still a fair amount of overplotting going on. Can we have separate graphs of price vs. carat for each cut?
This is called splitting the plot into facets. R allows us to do this by using the function facet_wrap()
. Use the following code to facet the plot by a single variable:
ggplot(data = diamonds, mapping = aes(x = carat, y = price, col = cut)) +
geom_point(alpha = 0.2) +
scale_colour_brewer(palette = "YlOrRd") +
facet_wrap(~ cut)
By default, R put just 3 subplots in each row. We can change this by adding a nrow
argument to facet_wrap()
:
ggplot(data = diamonds, mapping = aes(x = carat, y = price, col = cut)) +
geom_point(alpha = 0.2) +
scale_colour_brewer(palette = "YlOrRd") +
facet_wrap(~ cut, nrow = 1)
Facetting didn’t help too much in this case, since the plots for the better cuts look very similar to one another. Perhaps we could add a smoothing layer to the original plot:
ggplot(data = diamonds, mapping = aes(x = carat, y = price, col = cut)) +
geom_point(alpha = 0.2) +
scale_colour_brewer(palette = "YlOrRd") +
geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
As you can probably see, the possibilities are endless! You can try plotting different variables against each other and see if you get anything interesting.
If we want to facet by more than 1 variable, we can do so with facet_grid()
. The variable before the ~
sign will be split on the rows, while the variable after the ~
sign will be split on the columns:
ggplot(data = diamonds, mapping = aes(x = carat, y = price)) +
geom_point(alpha = 0.2) +
facet_grid(cut ~ color)
Let’s say you’re satisfied with the scatterplot of price vs. carat with color denoting cut, and that you want to share it with others. The first thing you should do is label your axes and give your plot a title:
ggplot(data = diamonds, mapping = aes(x = carat, y = price, col = cut)) +
geom_point(alpha = 0.2) +
scale_colour_brewer(palette = "YlOrRd") +
labs(x = "Carat", y = "Price", title = "Plot of carat vs. price")
The size of the labels seems a bit small. We can adjust them using the theme()
function. Let’s centralize the plot title at the same time:
ggplot(data = diamonds, mapping = aes(x = carat, y = price, col = cut)) +
geom_point(alpha = 0.2) +
scale_colour_brewer(palette = "YlOrRd") +
labs(x = "Carat", y = "Price", title = "Plot of carat vs. price") +
theme(plot.title = element_text(size = rel(1.5), face = "bold", hjust = 0.5),
axis.title = element_text(size = rel(1.2)))
We can move the legend around by setting a legend.position
argument in theme()
(possible options are “none”, “left”, “right”, “bottom”, “top”):
ggplot(data = diamonds, mapping = aes(x = carat, y = price, col = cut)) +
geom_point(alpha = 0.2) +
scale_colour_brewer(palette = "YlOrRd") +
labs(x = "Carat", y = "Price", title = "Plot of carat vs. price") +
theme(plot.title = element_text(size = rel(2), face = "bold", hjust = 0.5),
axis.title = element_text(size = rel(1.5)),
legend.position = "bottom")
For a full (long!) list of attributes which can be modified, see this reference.
To save a plot, click on the button, and click “Save as Image…” You can adjust the size of your image in the pop-up before saving it.
It seems tedious to be changing these attributes for each graph we make. The nice thing about ggplot is that it lets us assign each part of the plot as a variable! For example, we could have reproduced the plot above using this code:
p <- ggplot(data = diamonds, mapping = aes(x = carat, y = price, col = cut)) +
geom_point(alpha = 0.2) +
scale_colour_brewer(palette = "YlOrRd") +
labs(x = "Carat", y = "Price", title = "Plot of carat vs. price")
th <- theme(plot.title = element_text(size = rel(1.5), face = "bold", hjust = 0.5),
axis.title = element_text(size = rel(1.2)),
legend.position = "bottom")
p # plot without the theme changes
p + th
I can now apply these adjustments to any plot I want by adding + th
at the end of the code:
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = price)) +
labs(title = "Histogram of price", x = "Price", y = "Count") +
th
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
sessionInfo()
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggplot2_3.0.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.17 RColorBrewer_1.1-2 pillar_1.3.0
## [4] compiler_3.5.1 plyr_1.8.4 bindr_0.1.1
## [7] tools_3.5.1 digest_0.6.15 viridisLite_0.3.0
## [10] lattice_0.20-35 nlme_3.1-137 evaluate_0.10.1
## [13] tibble_1.4.2 gtable_0.2.0 mgcv_1.8-24
## [16] pkgconfig_2.0.1 rlang_0.2.1 Matrix_1.2-14
## [19] cli_1.0.0 yaml_2.1.19 bindrcpp_0.2.2
## [22] withr_2.1.2 dplyr_0.7.6 stringr_1.3.1
## [25] knitr_1.20 rprojroot_1.3-2 grid_3.5.1
## [28] tidyselect_0.2.4 glue_1.2.0 R6_2.2.2
## [31] fansi_0.2.3 rmarkdown_1.10 reshape2_1.4.3
## [34] purrr_0.2.5 magrittr_1.5 backports_1.1.2
## [37] scales_0.5.0 htmltools_0.3.6 assertthat_0.2.0
## [40] colorspace_1.3-2 labeling_0.3 utf8_1.1.4
## [43] stringi_1.2.3 lazyeval_0.2.1 munsell_0.5.0
## [46] crayon_1.3.4